Skip to content

Support RL online quantization with torchao#23014

Merged
vllm-bot merged 1 commit intovllm-project:mainfrom
jerryzh168:torchao-on-the-fly-quant
Oct 1, 2025
Merged

Support RL online quantization with torchao#23014
vllm-bot merged 1 commit intovllm-project:mainfrom
jerryzh168:torchao-on-the-fly-quant

Conversation

@jerryzh168
Copy link
Contributor

@jerryzh168 jerryzh168 commented Aug 15, 2025

Summary:
This is to enable online quant for verl. The PR
added support for initializing a TorchAOConfig object in vllm
through a serialized json file that specifies the type of quantization
people want. Or a json serialized TorchAOConfig object

Code for serializing the config to json:

from torchao.quantization import Float8DynamicActivationFloat8WeightConfig, PerRow
from torchao.core.config import config_to_dict
import json

config = Float8DynamicActivationFloat8WeightConfig(granularity=PerRow())

json_str = json.dumps(config_to_dict(config))

LLM(..., quantization="torchao", hf_overrides={"quantization_config_dict_json": json_str})

Code for serializing the config to file

from torchao.quantization import Float8DynamicActivationFloat8WeightConfig, PerRow
from torchao.core.config import config_to_dict
import json

config = Float8DynamicActivationFloat8WeightConfig(granularity=PerRow())

with open("torchao_config.json", "w") as f:
    f.write(json.dumps(config_to_dict(config)))

LLM(..., quantization="torchao", hf_overrides={"quantization_config_file": "torchao_config.json"})

This also supports module level config as well through the ModuleFqnToConfig config
https://huggingface.co/docs/transformers/main/en/quantization/torchao#per-module-quantization
although not tested yet.

more configs: https://docs.pytorch.org/ao/main/api_ref_quantization.html#inference-apis-for-quantize

Note: this has incorporated changes from @LiyuanLucasLiu's PR: #23901, although vllm fp8 quant method is not supported yet, we can add that in a separate PR

Test Plan:
pytest tests/quantization/test_torchao.py -k test_on_the_fly_quant
pytest tests/quantization/test_torchao.py -k test_reload_weights

and regression tests
pytest tests/quantization/test_torchao.py

Reviewers:

Subscribers:

Tasks:

Tags:

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to initialize TorchAOConfig from a file, which is a great step towards enabling on-the-fly quantization. The changes span configuration, the torchao quantization layer, and weight loading utilities. While the overall direction is good, I've identified a few critical and high-severity issues. There's a significant logic bug in weight_utils.py that seems to prevent the feature from working on models that are not already quantized. Another critical issue is that dummy weight initialization for profiling has been commented out, which will likely break profiling runs. Additionally, I've pointed out a couple of high-severity issues in the new torchao code related to a potential TypeError from an unsafe method signature and a hardcoded dtype marked as a "temp hack". I've provided specific suggestions to address each of these points.

jerryzh168 added a commit to jerryzh168/verl that referenced this pull request Aug 15, 2025
Summary:
Only supporting quantizing all linear layers with torchao config for now. see vllm PR for
how to generate the quantization file.
Also requires vllm changes: vllm-project/vllm#23014

Test Plan:
sh examples/ppo_trainer/run_deepseek7b_llm.sh

Reviewers:

Subscribers:

Tasks:

Tags:
@jerryzh168 jerryzh168 marked this pull request as draft August 16, 2025 00:01
jerryzh168 added a commit to jerryzh168/verl that referenced this pull request Aug 16, 2025
Summary:
Only supporting quantizing all linear layers with torchao config for now. see vllm PR for
how to generate the quantization file.
Also requires vllm changes: vllm-project/vllm#23014

Test Plan:
sh examples/ppo_trainer/run_deepseek7b_llm.sh

Reviewers:

Subscribers:

Tasks:

Tags:
@jerryzh168
Copy link
Contributor Author

waiting on verl to confirm the API changes make sense first, before cleaning up this PR for review

@jerryzh168 jerryzh168 force-pushed the torchao-on-the-fly-quant branch 2 times, most recently from 9a3bf05 to c8b5d20 Compare August 27, 2025 20:39
@jerryzh168 jerryzh168 force-pushed the torchao-on-the-fly-quant branch from c8b5d20 to 7548891 Compare September 17, 2025 04:07
@jerryzh168 jerryzh168 changed the title Allows initialize TorchAOConfig object through quantization_config_file Support on the fly quantization with torchao Sep 17, 2025
@jerryzh168 jerryzh168 force-pushed the torchao-on-the-fly-quant branch 4 times, most recently from a0f395b to bf57db6 Compare September 18, 2025 00:33
@jerryzh168 jerryzh168 requested a review from 22quinn September 18, 2025 00:37
@jerryzh168 jerryzh168 force-pushed the torchao-on-the-fly-quant branch from bf57db6 to 8538d42 Compare September 18, 2025 00:48
@jerryzh168 jerryzh168 force-pushed the torchao-on-the-fly-quant branch from 2b3cfe0 to b073702 Compare October 1, 2025 18:14
@jerryzh168
Copy link
Contributor Author

can't repro quantization test timeout locally, rebasing and running the tests again to see if persists

@jerryzh168
Copy link
Contributor Author

OK quantization tests passed, Language Model Tests failing but they are not related to the changes I think

also saw the Language Model Tests failing in main: https://buildkite.com/vllm/ci/builds/33089/steps/canvas

I think it's safe to merge now

@vllm-bot vllm-bot merged commit c312468 into vllm-project:main Oct 1, 2025
49 of 52 checks passed
jerryzh168 added a commit to jerryzh168/verl that referenced this pull request Oct 2, 2025
Summary:
Only supporting quantizing all linear layers with torchao config for now. see vllm PR for
how to generate the quantization file.
Also requires vllm changes: vllm-project/vllm#23014

Test Plan:
sh examples/ppo_trainer/run_deepseek7b_llm.sh

Reviewers:

Subscribers:

Tasks:

Tags:
pdasigi pushed a commit to pdasigi/vllm that referenced this pull request Oct 2, 2025
Signed-off-by: Jerry Zhang <jerryzh168@gmail.com>
jerryzh168 added a commit to jerryzh168/verl that referenced this pull request Oct 3, 2025
Summary:
Only supporting quantizing all linear layers with torchao config for now. see vllm PR for
how to generate the quantization file.
Also requires vllm changes: vllm-project/vllm#23014

Test Plan:
sh examples/ppo_trainer/run_deepseek7b_llm.sh

Reviewers:

Subscribers:

Tasks:

Tags:
yewentao256 pushed a commit that referenced this pull request Oct 3, 2025
Signed-off-by: Jerry Zhang <jerryzh168@gmail.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
tomeras91 pushed a commit to tomeras91/vllm that referenced this pull request Oct 6, 2025
Signed-off-by: Jerry Zhang <jerryzh168@gmail.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
@22quinn 22quinn mentioned this pull request Oct 9, 2025
1 task
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
Signed-off-by: Jerry Zhang <jerryzh168@gmail.com>
alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025
Signed-off-by: Jerry Zhang <jerryzh168@gmail.com>
jerryzh168 added a commit to jerryzh168/verl that referenced this pull request Nov 3, 2025
Summary:
Only supporting quantizing all linear layers with torchao config for now. see vllm PR for
how to generate the quantization file.
Also requires vllm changes: vllm-project/vllm#23014

Test Plan:
sh examples/ppo_trainer/run_deepseek7b_llm.sh

Reviewers:

Subscribers:

Tasks:

Tags:
jerryzh168 added a commit to jerryzh168/verl that referenced this pull request Nov 3, 2025
Summary:
Only supporting quantizing all linear layers with torchao config for now. see vllm PR for
how to generate the quantization file.
Also requires vllm changes: vllm-project/vllm#23014

Test Plan:
sh examples/ppo_trainer/run_deepseek7b_llm.sh

Reviewers:

Subscribers:

Tasks:

Tags:
rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025
Signed-off-by: Jerry Zhang <jerryzh168@gmail.com>
jerryzh168 added a commit to jerryzh168/verl that referenced this pull request Nov 13, 2025
Summary:
Only supporting quantizing all linear layers with torchao config for now. see vllm PR for
how to generate the quantization file.
Also requires vllm changes: vllm-project/vllm#23014

Test Plan:
sh examples/ppo_trainer/run_deepseek7b_llm.sh

Reviewers:

Subscribers:

Tasks:

Tags:
jerryzh168 added a commit to jerryzh168/verl that referenced this pull request Nov 13, 2025
Summary:
Only supporting quantizing all linear layers with torchao config for now. see vllm PR for
how to generate the quantization file.
Also requires vllm changes: vllm-project/vllm#23014

Test Plan:
sh examples/ppo_trainer/run_deepseek7b_llm.sh

Reviewers:

Subscribers:

Tasks:

Tags:
jerryzh168 added a commit to jerryzh168/verl that referenced this pull request Nov 13, 2025
Summary:
Only supporting quantizing all linear layers with torchao config for now. see vllm PR for
how to generate the quantization file.
Also requires vllm changes: vllm-project/vllm#23014

Test Plan:
sh examples/ppo_trainer/run_deepseek7b_llm.sh

Reviewers:

Subscribers:

Tasks:

Tags:
jerryzh168 added a commit to jerryzh168/verl that referenced this pull request Nov 19, 2025
Summary:
Only supporting quantizing all linear layers with torchao config for now. see vllm PR for
how to generate the quantization file.
Also requires vllm changes: vllm-project/vllm#23014

Test Plan:
sh examples/ppo_trainer/run_deepseek7b_llm.sh

Reviewers:

Subscribers:

Tasks:

Tags:
jerryzh168 added a commit to jerryzh168/verl that referenced this pull request Nov 19, 2025
Summary:
Only supporting quantizing all linear layers with torchao config for now. see vllm PR for
how to generate the quantization file.
Also requires vllm changes: vllm-project/vllm#23014

Test Plan:
sh examples/ppo_trainer/run_deepseek7b_llm.sh

Reviewers:

Subscribers:

Tasks:

Tags:
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
Signed-off-by: Jerry Zhang <jerryzh168@gmail.com>
andrewor14 added a commit to andrewor14/unsloth that referenced this pull request Dec 11, 2025
**Summary:** Existing support for `load_in_fp8=True` performs
an offline quantization when loading the initial model.
This is no longer necessary as of vllm==0.12.0 (after
vllm-project/vllm#23014), where we
can quantize the model on-the-fly when we load it:

```
llm = LLM(
  ...
  hf_overrides={
    "quantization_config_dict_str": json.dumps(torchao_config),
  },
)
```

**Test Plan:**
https://gist.github.com/andrewor14/5b85119fae46845d07b608d420907423
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants